<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<meta http-equiv="X-UA-Compatible" content="IE=edge">
    <meta name="viewport" content="width=device-width, initial-scale=1">
	<title>Pattern Analyzer | ElasticSearch 7.7 权威指南中文版</title>
	<meta name="keywords" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <meta name="description" content="ElasticSearch 权威指南中文版, elasticsearch 7, es7, 实时数据分析，实时数据检索" />
    <!-- Give IE8 a fighting chance -->
    <!--[if lt IE 9]>
    <script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
    <script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
    <![endif]-->
	<link rel="stylesheet" type="text/css" href="../static/styles.css" />
	<script>
	var _link = 'analysis-pattern-analyzer.html';
    </script>
</head>
<body>
<div class="main-container">
    <section id="content">
        <div class="content-wrapper">
            <section id="guide" lang="zh_cn">
                <div class="container">
                    <div class="row">
                        <div class="col-xs-12 col-sm-8 col-md-8 guide-section">
                            <div style="color:gray; word-break: break-all; font-size:12px;">原英文版地址: <a href="https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-pattern-analyzer.html" rel="nofollow" target="_blank">https://www.elastic.co/guide/en/elasticsearch/reference/7.7/analysis-pattern-analyzer.html</a>, 原文档版权归 www.elastic.co 所有<br/>本地英文版地址: <a href="../en/analysis-pattern-analyzer.html" rel="nofollow" target="_blank">../en/analysis-pattern-analyzer.html</a></div>
                        <!-- start body -->
                  <div class="page_header">
<strong>重要</strong>: 此版本不会发布额外的bug修复或文档更新。最新信息请参考 <a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html" rel="nofollow">当前版本文档</a>。
</div>
<div id="content">
<div class="breadcrumbs">
<span class="breadcrumb-link"><a href="index.html">Elasticsearch Guide [7.7]</a></span>
»
<span class="breadcrumb-link"><a href="analysis.html">Text analysis</a></span>
»
<span class="breadcrumb-link"><a href="analysis-analyzers.html">Built-in analyzer reference</a></span>
»
<span class="breadcrumb-node">Pattern Analyzer</span>
</div>
<div class="navheader">
<span class="prev">
<a href="analysis-lang-analyzer.html">« Language Analyzers</a>
</span>
<span class="next">
<a href="analysis-simple-analyzer.html">Simple Analyzer »</a>
</span>
</div>
<div class="section">
<div class="titlepage"><div><div>
<h2 class="title">
<a id="analysis-pattern-analyzer"></a>Pattern Analyzer<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h2>
</div></div></div>
<p>The <code class="literal">pattern</code> analyzer uses a regular expression to split the text into terms.
The regular expression should match the <span class="strong strong"><strong>token separators</strong></span>  not the tokens
themselves. The regular expression defaults to <code class="literal">\W+</code> (or all non-word characters).</p>
<div class="warning admon">
<div class="icon"></div>
<div class="admon_content">
<h3>Beware of Pathological Regular Expressions</h3>
<p>The pattern analyzer uses
<a href="http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html" class="ulink" target="_top">Java Regular Expressions</a>.</p>
<p>A badly written regular expression could run very slowly or even throw a
StackOverflowError and cause the node it is running on to exit suddenly.</p>
<p>Read more about <a href="http://www.regular-expressions.info/catastrophic.html" class="ulink" target="_top">pathological regular expressions and how to avoid them</a>.</p>
</div>
</div>
<h3>
<a id="_example_output_3"></a>Example output<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h3>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/847.console"></div>
<p>The above sentence would produce the following terms:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]</pre>
</div>
<h3>
<a id="_configuration_4"></a>Configuration<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h3>
<p>The <code class="literal">pattern</code> analyzer accepts the following parameters:</p>
<div class="informaltable">
<table border="0" cellpadding="4px">
<colgroup>
<col>
<col>
</colgroup>
<tbody valign="top">
<tr>
<td valign="top">
<p>
<code class="literal">pattern</code>
</p>
</td>
<td valign="top">
<p>
A <a href="http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html" class="ulink" target="_top">Java regular expression</a>, defaults to <code class="literal">\W+</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">flags</code>
</p>
</td>
<td valign="top">
<p>
Java regular expression <a href="http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#field.summary" class="ulink" target="_top">flags</a>.
Flags should be pipe-separated, eg <code class="literal">"CASE_INSENSITIVE|COMMENTS"</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">lowercase</code>
</p>
</td>
<td valign="top">
<p>
Should terms be lowercased or not. Defaults to <code class="literal">true</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">stopwords</code>
</p>
</td>
<td valign="top">
<p>
A pre-defined stop words list like <code class="literal">_english_</code> or an array  containing a
list of stop words.  Defaults to <code class="literal">_none_</code>.
</p>
</td>
</tr>
<tr>
<td valign="top">
<p>
<code class="literal">stopwords_path</code>
</p>
</td>
<td valign="top">
<p>
The path to a file containing stop words.
</p>
</td>
</tr>
</tbody>
</table>
</div>
<p>See the <a class="xref" href="analysis-stop-tokenfilter.html" title="Stop token filter">Stop Token Filter</a> for more information
about stop word configuration.</p>
<h3>
<a id="_example_configuration_3"></a>Example configuration<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h3>
<p>In this example, we configure the <code class="literal">pattern</code> analyzer to split email addresses
on non-word characters or on underscores (<code class="literal">\W|_</code>), and to lower-case the result:</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", <a id="CO409-1"></a><i class="conum" data-value="1"></i>
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/848.console"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO409-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>The backslashes in the pattern need to be escaped when specifying the
pattern as a JSON string.</p>
</td>
</tr>
</table>
</div>
<p>The above example produces the following terms:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ john, smith, foo, bar, com ]</pre>
</div>
<h4>
<a id="_camelcase_tokenizer"></a>CamelCase tokenizer<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h4>
<p>The following more complicated example splits CamelCase text into tokens:</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?&lt;=\\D)(?=\\d)|(?&lt;=\\d)(?=\\D)|(?&lt;=[\\p{L}&amp;&amp;[^\\p{Lu}]])(?=\\p{Lu})|(?&lt;=\\p{Lu})(?=\\p{Lu}[\\p{L}&amp;&amp;[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/849.console"></div>
<p>The above example produces the following terms:</p>
<div class="pre_wrapper lang-text">
<pre class="programlisting prettyprint lang-text">[ moose, x, ftp, class, 2, beta ]</pre>
</div>
<p>The regex above is easier to understand as:</p>
<div class="pre_wrapper lang-regex">
<pre class="programlisting prettyprint lang-regex">  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?&lt;=\D)(?=\d)                 # or non-number followed by number,
| (?&lt;=\d)(?=\D)                 # or number followed by non-number,
| (?&lt;=[ \p{L} &amp;&amp; [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?&lt;=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&amp;&amp;[^\p{Lu}]]          #   then lower case
  )</pre>
</div>
<h3>
<a id="_definition_3"></a>Definition<a class="edit_me edit_me_private" rel="nofollow" title="Editing on GitHub is available to Elastic" href="https://github.com/elastic/elasticsearch/edit/7.7/docs/reference/analysis/analyzers/pattern-analyzer.asciidoc">edit</a>
</h3>
<p>The <code class="literal">pattern</code> anlayzer consists of:</p>
<div class="variablelist">
<dl class="variablelist">
<dt>
<span class="term">
Tokenizer
</span>
</dt>
<dd>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<a class="xref" href="analysis-pattern-tokenizer.html" title="Pattern Tokenizer">Pattern Tokenizer</a>
</li>
</ul>
</div>
</dd>
<dt>
<span class="term">
Token Filters
</span>
</dt>
<dd>
<div class="ulist itemizedlist">
<ul class="itemizedlist">
<li class="listitem">
<a class="xref" href="analysis-lowercase-tokenfilter.html" title="Lowercase token filter">Lower Case Token Filter</a>
</li>
<li class="listitem">
<a class="xref" href="analysis-stop-tokenfilter.html" title="Stop token filter">Stop Token Filter</a> (disabled by default)
</li>
</ul>
</div>
</dd>
</dl>
</div>
<p>If you need to customize the <code class="literal">pattern</code> analyzer beyond the configuration
parameters then you need to recreate it as a <code class="literal">custom</code> analyzer and modify
it, usually by adding token filters. This would recreate the built-in
<code class="literal">pattern</code> analyzer and you can use it as a starting point for further
customization:</p>
<div class="pre_wrapper lang-console">
<pre class="programlisting prettyprint lang-console">PUT /pattern_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" <a id="CO410-1"></a><i class="conum" data-value="1"></i>
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       <a id="CO410-2"></a><i class="conum" data-value="2"></i>
          ]
        }
      }
    }
  }
}</pre>
</div>
<div class="console_widget" data-snippet="snippets/850.console"></div>
<div class="calloutlist">
<table border="0" summary="Callout list">
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO410-1"><i class="conum" data-value="1"></i></a></p>
</td>
<td align="left" valign="top">
<p>The default pattern is <code class="literal">\W+</code> which splits on non-word characters
and this is where you’d change it.</p>
</td>
</tr>
<tr>
<td align="left" valign="top" width="5%">
<p><a href="#CO410-2"><i class="conum" data-value="2"></i></a></p>
</td>
<td align="left" valign="top">
<p>You’d add other token filters after <code class="literal">lowercase</code>.</p>
</td>
</tr>
</table>
</div>
</div>
<div class="navfooter">
<span class="prev">
<a href="analysis-lang-analyzer.html">« Language Analyzers</a>
</span>
<span class="next">
<a href="analysis-simple-analyzer.html">Simple Analyzer »</a>
</span>
</div>
</div>

                  <!-- end body -->
                        </div>
                        <div class="col-xs-12 col-sm-4 col-md-4" id="right_col">
                        
                        </div>
                    </div>
                </div>
            </section>
        </div>
    </section>
</div>
<script src="../static/cn.js"></script>
</body>
</html>